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\ Concurrent  Aigoritnins  as  Space-time  Recursion  Equations* 


Marina  C.  Chen  and  Carver  A.  Meaa+ 


Introduction 


It  is  by  now  well  recognized  that  VLSI  technology  has  brought  about  a medium  which 
a ows  the  realization  of  orders  of  magnitude  more  computing  elements  per  unit  cost. 

e more  significant  contribution  of  VLSI  to  Computer  Science  will  be  in  the  utilization 
o many  hundreds  or  thousands  of  these  elements  concurrently  to  achieve  a given  com- 

? is  Clear  fr^  existence  Pro°fs  of  such  innovative  designs  as  systolic  arrays 
[Kung  & Leiserson80j,  tree  machine  algorithms  [BROWNING80],  computational  ar- 
rays [JOHNSSON  ET  al.81],  wavefront  arrays  [Kung,S.Y.80],  etc.  that  vast  performance 
improvements  can  be  achieved  if  the  design  of  so-called  “high-level”  algorithms  is  released 
from  the  one  dimensional  world  of  a sequential  process,  and  the  cost  of  communications  h 
space  as  we  as  cost  of  computation  in  time  is  taken  into  consideration  [SUTHERLAND  & 
ME AD77J . While  this  higher  dimensional  design  space  provides  a great  playground  for  in- 
novative algorithm  design,  it  also  introduces  pitfalls  unapprehended  by  those  accustomed 
to  the  world  of  a single  sequential  process.  Verification  of  algorithms  becomes  much  more 
crucial  m system  designs  because  debugging  concurrent  programs  can  very  easily  become 
an  exponentially  complicated  task  in  this  rich  space.  The  real  difficulty  lies  in  the  high 
degree  of  complexity  of  concurrent  systems.  The  well-known  hierarchical  approach  can 
be  useci  to  manage  the  design  complexity  for  such  systems.  A system  is  broken  down  into 
successive  levels  of  sub-systems  until  each  is  of  a manageable  complexity.  The  effectiveness 
of  this  approach  relies  on  two  basic  tools:  A design  and  verification  methodology  for  each 
level  and  an  abstraction  mechanism  to  go  from  one  level  to  the  next.  The  latterls  crucially 
important,  for  without  it  the  consistency  of  the  whole  system  is  imperiled. 

In  this  paper,  we  describe  a methodology  and  a single  notation  for  the  specification 
an  verification  of  synchronous  and  self-timed  concurrent  systems  ranging  from  the  level 
of  transistors  to  communicating  processes.  The  uniform  treatment  of  these  systems  results 
m a powerful  abstraction  mechanism  which  allows  management  of  system  complexity.  <C 


Traditionally , due  to  the  assumption  that  the  cost  of  accessing  variables  in  memory  is 
—e  same  reSarGless  of  ^beir  locations,  sequential  algorithms  ignore  the  spatial  relationships 
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oi  variables.  In  addition,  the  steps  of  a computation  have  not  been  explicitly  expressed 
as  a function  of  time,  but  are  rather  implied  by  programming  constructs.  Languages 
that  cannot  express  the  spatial  relationships  of  variables  cannot  take  into  account  the 
most  important  aspect  in  the  design  of  a concurrent  algorithm,  i.e.  ensuring  localitv  of 
communications,  taking  advantage  of  the  interplay  of  variables  in  space  (in  practice  up  to 
3 dimensions)  to  achieve  higher  performance.  The  implicit  “time”  causes  programming 
languages  to  suffer  either  from  not  being  able  to  abstract  the  history  of  computation  (e.g. 
in  applicative  and  data-flow  languages  [KAHN74,  BAKUS78]),  or  not  being  able  to  abstract 
computation  in  a clean  functional  form  (e.g.  in  assignment-based  languages).  Here  we 
choose  to  make  “time”  an  explicit  parameter  of  computation.  We  call  our  representation 
of  computation  a “Space-time  Algorithm” 


In  [CHEN82],  CRYSTAL  (Concurrent  Representation  of  Your  Space  Time  ALgorithm), 
a notation  for  concurrent  programming  is  proposed.  The  fixed-point  approach  [SCOTT 
& STRACHEY71]  is  used  for  characterizing  the  semantics.  Within  this  framework,  a 
program  is  expressed  as  a set  of  systems  of  recursion  equations.  Unknowns  of  the  equa- 
tions are  data  expressed  as  functions  from  the  space-time  domain  to  the  value  domain. 
For  a deterministic  concurrent  system,  such  as  a systolic  array,  a single  system  of  equa- 
tions results,  and  the  semantics  of  such  a system  is  defined  as  the  least  solution  of  the 
equations.  The  semantics  of  concurrent  systems  in  general  can  be  characterized  as  the 
corresponding  set  of  solutions  of  the  set  of  systems  of  equations.  In  this  paper,  we  con- 
centrate on  deterministic  concurrent  systems  at  the  communicating  sequential  processes 
level.  We  will  first  present  briefly  considerations  that  are  generic  to  all  systems,  i.e.,  the 
underlying  model  of  computation,  the  representation,  and  the  mathematical  semantics  of 
the  systems.  Various  inductive  techniques  (see  for  example  [MANNA74])  used  in  verifying 
recursive  programs  can  be  directly  applied  in  verifying  space-time  algorithms  and  proving 
their  properties.  We  demonstrate  this  framework  by  presenting  both  the  synchronous  and 
self-timed  version  of  the  matrix  multiplication  on  systolic  arrays  [Kung  & Leiserson80] 
with  its  proof  of  correctness.  The  notion  of  wavefront  is  especially  important  in  this  class 
of  computations.  We  define  the  “phase”  of  a computation  wave  in  a way  that  is  analogous 
to  the  wave  in  physical  world.  The  set  of  all  possible  “phases”  can  be  formalized  as  a 
well-founded  set,  upon  which  the  inductive  proof  is  based. 


Model  of  Computation 


The  model  consists  of  an  ensemble  of  sequential  processes  each  of  which  has  its  own 
local  state  and  ports  for  communicating  with  other  processes.  Depending  on  the  level  of 
system  concerned,  these  processes  can  be  as  simple  as  a single  transistor  or  as  fancy  as  a 
conventional  von  Neumann  type  machine.  A sequential  process  consists  of  a function  that 


maps  Uom  inputs  and  current-states  to  outputs  and  next-states.  Such  a function  uniquely 
defines  a sequential  process.  It  is  the  generator  of  the  output  sequence  and  state  with  given 
initial  state  and  an  input  sequence.  The  state  captures  the  semantic  abstraction  of  the 
history.  No  assertion  about  the  process  can  depend  unon  history  in  a wav  not  captured 


by  the  state.  A single  invocation  of  the  function  i)  evaluates  the  function,  iilupdates  the 
state  anc.  outputs,  iii)mcrements  the  process’s  “time”,  all  as  an  atomic  event. 

In  a particular  process,  “time”  is  a measure  of  how  many  invocations  have  occurred 
and  space  is  where  the  process  is  located.  “Time”  is  a property  local  to  each  process.’ 
Note  tuat  s*ate  is  explicitly  represented,  a function  is  not  defined  from  the  history  of  inputs 
to  outputs  as  m the  applicative  and  data  flow  model  of  computation.  Communications 
among  processes  m space  are  specified  by  identifying  inputs  of  one  process  with  outputs  of 
other  processes  m the  space-time  domain.  Also  note  that  a transition  from  one  invocation 

to  the  next  within  a sequential  process  can  be  viewed  as  a communication  in  the  ume 
domain  (fixed  m space). 

The  “slicing”  of  a sequential  process  into  a sequence  of  functions  is  done  at  tlie 
communication  with  the  external  world.  Inputs  from  several  different  processes  which  are 
angned  m time  and  used  as  arguments  to  a single  function  are  considered  as  one  external 
input  event,  i.e.  one  invocation.  Within  the  same  slice,  no  side-effects  are  allowed  3 e 
each  slice  is  strictly  functional.  We  enforce  this  discipline  by  using  a purely  applicative 
programming  notation  (like  pure  Lisp  and  Backus’s  FP  notation)  to  implement  atomic 
i unctions,  which  cannot  be  further  sliced  either  in  space  or  in  time.  Any  higher  level 
system  is  constructed  by  composing  atomic  functions  and  other  existing  systems  using 
recursion  equations.  The  resulting  system  is  always  transformed  into  a function  from 
inputs  and  current-states  to  outputs  and  next-states,  i.e.  a sequential  process.  It  is  often 
the  case  that  once  the  system  is  implemented,  a sequence  of  inputs  can  be  conveniently 
considered  as  a set  of  inputs  at  the  next  level  up.  Although  internal  state  is  used  as  part 
o the  implementation,  the  outputs  can  be  expressed  as  a function  of  such  a sequence 
o inputs  without  refering  to  the  state.  In  such  a case  we  abstract  the  process  as  :an 
applicative  function  and  it  can,  once  again,  be  treated  as  if  it  were  atomic.  In  real-time 
sj stems,  this  kind  of  abstraction  is  not  possible  since  the  sequence  of  inputs  cannot  be 
treated  as  a static  input,  making  explicit  state  still  necessary. 

Thus  space-time  algorithms  are  either  purely  applicate  programs  or  recursion  equa- 
tions. Note  that  in  this  way,  states  can  be  expressed  without  side-effects.  The  change 
rom  viewing  an  applicative  system  as  the  universe  to  using  it  only  for  .n  atom  is  the  key 
to  the  applicability  of  our  framework  to  real  systems.  The  applicative  model  of  comput- 
mg  suffers  one  major  drawback  in  not  being  able  to  retain  the  result  of  a computation 
so  that  it  can  be  used  m a different  place  or  at  a later  time.  The  data  flow  model  u a 
remedy  for  this  problem  only  in  space.  The  essential  ability  to  use  a result  in  several 
p aces  is  captured  by  the  data-flow  equations  devised  by  Kahn  [KAHN74],  Unfortunately 
this  model  stilMacks  the  essential  capability  of  capturing  the  state,  the  result  in  the 
Ume  domain,  ihis  fact  is  manifested  in  the  proliferation  of  assignment-based  data-flow 
anguages  [Ackerman,  W.B.j.  The  elegance  of  data-flow  equations  cannot  help  the  im- 
plementation of  real  world  systems  where  state  is  necessary.  The  space-time  recursion 
equations  using  a purely  applicative  language  can  be  applied  to  real  world  problems*. 
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This  insight  is  the  most  important  contribution  of  our  work  to  computer  programming. 
We  thus  retain  the  elegance  and  formal  cleanness  of  functional  application  together  with 
the  essential  ability  to  abstract  history  into  a compact  form. 

Representation 


Let  x — (xi,  z2, . . .,  xm)  denote  the  inputs  and  current-states  of  a process,  where 
x*  € Vi  for  x = 1,2,  Here  D{  is  the  domain*  where  the  input  or  current- state  % 

assumes  its  value.  We  define  a function  / which  maps  the  vector  of  inputs  and  current 
states  to  the  vector  of  outputs  and  next  states: 


/ : Vm  -*  Dn, 

/ = (Xx./i,Xx./2,...,Xx./n)  (g) 

where  n is  the  total  number  of  outputs  and  states. 

Each  component,  Xx./,-,  of  such  a function  is  an  element  of  [Dm  -*  D],  These  functions 
must  be  monotonic  (see  for  example  [MANNA  74])  over  Dm . 


In  order  to  capture  the  notion  of  flow  of  the  data  and  the  structure  of  these  data 
we  define  data  streams.  Each  “stream”  of  data  is  represented  by  a function  from  the 
space-time  domain  V to  the  value  domain  V.  We  next  define  structured  processes  which 
define  the  location  of  the  processes  making  up  the  ensemble,  as  a function  in  the  domain 
L V — ► [Z?  ->  D\\.  This  function  is  defined  by  cases  for  different  process  types  in  the  space 

Cases  are  specified  in  the  notation  of  [DlJKSTRA  76].  The  complexity  of  this  definition 
reflects  the  heterogeneity  of  the  process  types. 


The  relationship  among  the  structured  input  and  output  data  streams  and  structured 
processes  and  functions  are  defined  in  a point- wise  manner.  In  the  space-time  domain,  an 
ootput  is  the  result  of  the  application  of  the  function  at  that  point  to  input  data  streams 
at  that  point.  Through  connections,  the  input  (current-state)  streams  of  one  function  are 
identified  with  output  (next-state)  streams  of  other  functions,  or  with  initial/boundary 
rnm'  therefore  define  structured  connections  also  as  a function  in  the  domain  [V  — ► 
7*  ^he  description  of  this  function  reflects  the  regularity  of  the  connections.  By 
substituting  the  equations  of  structured  connections  into  those  of  structured  processes 
we  obtain  a system  of  recursion  equations  that  define  output  streams  in  terms  of  output 
streams  and  initial/boundary  conditions. 


\n  obvious  restriction  on  these  recursion  relations  is  that  the  time  components  can 
on  y increase  by  on  unit  at  a time,  i.e.  an  argument  presented  to  the  input  of  a function 

at  its  time  t will  affect  the  value  of  that  function  which  appears  at  its  output  at  its 
Time”  t + 1. 


•Technically  domains  are  not  sets  but  complete  lattices  wi 

[Manna 7 4j  and  ^Scott  & Strachsy71], 


with  approximation  ordering.  Please  refer  to 


I general,  an  input  stream  at  a given  point  in  the  space-time  domain  can  connect 
to  an  output  stream  at  any  other  point  in  space.  In  specific  cases,  such  as  when  the 
low-cost  neighboring  communications  are  used,  inputs  are  connected  to  outputs  of  neigh- 
boring processes.  In  the  case  of  such  neighboring  connections,  the  relations  are  local  in  all 
dimensions,  it  is  then  possible  to  use  difference  equations  for  our  specification  Recursion 
relations  retain  more  information  in  the  sense  that  the  “phase”  of  a computation  wave 
is  embedded  m the  description.  For  more  complex  situations,  involving  non-local  connec- 
tions, the  greater  generality  of  recursion  relations  is  essential. 

Semantics  and  Abstraction 

By  thf  well-known  fixed-point  theory[LASSEZ,  NGUYEN  &SONENBERG  82]  the 
unique  minimum  solution  of  any  system  of  recursion  equations  exists.  This  minimum 
solution  is  taken  to  be  the  function  that  the  system  computes.  The  process  of  finding  this 
minimum  solution  can  be  described  intuitively  by  the  following  successive  approximation 
procedure.  We  first  approximate  the  solution  bv  the  set  of  n data  streams  that  are  totally 
undefined  in  the  space-time  domain  and  substitute  them  into  the  right-hand  side  of  the 
recursion  equations.  This  substitution  results  in  the  left-hand  side  which  is  a set  of  data 
streams  that  are  defined  only  on  the  point  in  space-time  domain  where  initial  values  are 
set.  These  data  streams  are  the  inputs  to  the  algorithm  and  we  refer  to  them  as  initial 
s reams.  Now  we  substitute  these  initial  streams  again  into  the  recursion  equation  and  get 
another  set  of  data  streams  that  have  even  more  points  in  the  space-time  domain  defined. 
We  repeat  this  process  until  no  more  points  in  the  space-time  domain  become  defined. 

his  pro '.ess  corresponds  exactly  to  the  process  of  computation.  The  only  restriction  is 
that  our  functions  must  be  monotonic,  i.e.  each  data  stream  at  any  iteration  is  always 
at  least  as  well  defined  as  it  was  on  the  previous  one.  We  do  not  allow  non-monotonic 
functions  that  destroy  results  which  have  already  become  defined.  Refer  to  [MANNA741 
[STOY77]  for  the  formalism. 

The  resulting  minimum  solution  consists  of  n data  streams.  Each  data  stream  is  a 
function  over  [V  ->  D).  In  order  to  construct  a higher  level  system  using  the  system  we 
have  just  obtained,  we  need  to  encapsulate  the  system  as  a sequential  process  or  a function 
lhe  function  defining  a sequential  process  or  the  single  applicative  function  is  from  value 
domain  to  value  domain.  This  encapsulation  usually  involves  some  transformation  from 
e original  data  structure  to  the  space-time  domain  for  the  inputs  and  a corresponding 
transiormation  for  the  outputs.  Technically,  the  procedure  is  as  follows: 

(1)  . The  input  mapping  function  maps  all  initial/boundary  values  from  an  abstract 

input  data  structure  to  the  inputs  unconnected  to  any  output  in  the  sp?ce-time 
domain. 

(2)  Compute  the  least  fixed  point  solution  in  the  space-time  domain  as  described 
above. 
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(3)  The  resulting  outputs  occur  at  those  points  in  the  space-Lme  domain  designated 
as  outputs  of  the  system.  The  output  mapping  function  maps  these  outputs  from 
an  element  of  [V  ->  D]n  to  an  element  of  the  output  value  domain  Dn> . 

, The  r('sultr  of  ,this  Procedure  is  the  abstract  definition  in  the  value  domain  of  the 
function  (type  [Dm  -»•  ln  ])  implemented  by  the  space-time  algorithm. 

Matrix  Multiplication  on  a Systolic  Array — Program  and  Semantics 

In  his  paper,  Kung  described  various  matrix  related  operations  performed  on  an  array 
o interconnected  hexagonal  elements.  We  present  his  algorithm  for  multiplying  two  full 
matrices  in  CRYSTAL  and  prove  the  correctness  of  the  algorithm. 

As  shown  in  figure  1,  hexagonal  elements  are  connected  into  a hexagonal  array.  Each 
element  has  three  inputs  and  three  outputs  as  shown  by  the  incoming  and  outgoing  arrows, 
respectively.  Such  a process  performs  an  inner  product  operation  in  the  north  and  south 

direction  , i.e.  cout  — c,n  + ain  X bin,  and  transmits  the  other  two  inputs  as  they  were, 
i.e.,  aout  = din,  bout  = 6,n. 


The  two  matrices  to  be  multiplied,  A and  B,  and  a matrix  C are  fed  into  the  array 
as  shown  m the  figure.  The  resulting  matrix  C’  will  come  out  at  the  top  of  the  array  as 
shown  Kung  s original  algorithm  assumes  a global  clock  thus  every  process  performs  an 
operation  synchronously.  When  data  items  are  fed  from  the  boundaries  of  the  array,  due 
o the  fact  that  every  process  is  forced  to  perform  an  operation  even  before  any  meaningful 
data  reacnes  the  process,  proper  initialization  of  the  system  by  padding  zeros  in  the  input 
streams  and  disposal  of  garbage  data  are  necessary.  The  same  algorithm  with  a different 
lining  scheme,  e.g.  self-timed  [Seitz  80]  scheme  can  simplify  conceptually  the  interaction 
of  processes  and  the  flow  of  data  and  renders  a simpler  initiation  of  the  system  This 
simplification  results  from  the  fact  that  the  self-timed  scheme  assures  that  each  process 
oes  not  perform  any  operation  until  all  the  meaningful  data  items  have  reached  the 
process.  On  the  other  hand,  the  self-timed  scheme  does  not  have  any  global  control  the 
ordering  of  the  system  events  is  an  emergent  property  of  the  local  synchronizations.  Thus 
the  specification  of  the  ordering  relations  among  invocations  of  processes  has  to  v verified 
rom  initial  data  arragement  since  self-timed  elements  are  triggered  by  the  arriva’  of  data. 

Both  algorithms  can  be  described  by  CRYSTAL  programs.  We  will  present  both 
versions  as  examples  of  our  notation  and  verification  methodology  and  discuss  some  of  the 
design  issues  of  synchronous  systems  vs.  self-timed  systems. 


In  writing  a CRYSTAL  program,  one  need  to  choose  an  appropriate  coordinate  system 
for  the  processes  in  space.  The  data  flow  of  the  array  has  a symmetry  which  can  be 
described  by  the  dihedral  group  of  order  3 [LlN  &MEAD  82],  As  shown  in  Figure  1 the 
3-aimensional  Cartesian  coordinate  system  is  chosen  to  reflect  this  symmetry.  The  center 


01  tms  hexagon  can  be  viewed  as  a corner  of  a cube.  The  hexagon  is  made  of  the  three 
races  that  contain  the  corner. 


Next  we  choose  a coordinate  system  in  tbe  time  domain  according  to  the  system 
lining  scheme  If  the  system  is  synchronous,  then  t denotes  the  number  of  system 
clock  cycles.  For  a self-timed  system,  each  process  has  its  own  time  frame.  If  it  is  a 
deterministic  system,  there  exists  a unique  partial  ordering  of  all  events  of  the  system 
For  a nondetermmistic  system,  there  exists  more  than  one  possible  partial  ordering  of 
system  events.  Thus  the  synchronous  system  is  a special  case  of  deterministic  systems 
where  the  unique  partial  ordering  on  events  is  controlled  by  the  system  clock 


Let  x,y,z  t be  non-negative  integers.  Define  the  following  predicates  which  specify 
the  location  of  processes  m the  space-time  domain  as  shown  in  Figure  2.  For  example 
restricts  the  xy  plane  to  an  area  within  the  specified  bounds  in  the  first  quadrant. 


<pxy  ={n  > x > 0)  A (n  >-y  > 0)  A (z  = 0) 

<pyz  =(n  > y > 0)  A (n  > 2 > 0)  A (a  = 0) 

<pzx  ==(n  > 2 > 0)  A (n  > x > 0)  A (y  = 0) 

<px  =(n  > x > 0)  A (y  — 0)  A (2  = 0) 

<py  =(n  > y > 0)  A (z  = 0)  A (x  = 0) 

<P*  =(n  > 2 > 0)  A (z  = 0)  A (y  = 0) 

¥>xyz  =[x  = 0)  A (y  — - 0)  A ( z = 0) 

<Ph  = < Pxy  V pyz  V ipzx  V ipx  V py  V <pz  V ipxyz 
<pa  =(0  < \x  - y|  < 3n)  A (0  < min(2x  - y,  2y  - x)  < 3n)  A (z  = 0) 

<pb  =(0  < \z-x | < 3n)  A (0  < min(2z -x,2 x-z)  < 3n)  a (y  = 0) 

<pc  =(0  < jy  -z\<  3 n)  A (0  < min(2y  - z,  2z  - y)  < 3 n)  A (x  = 0) 

<p„  =<pa  V <fib  V <fc 


<pa'  =<pxy  V <pzx  V <p2 


<Pb>  =ipyz  V <px y V <py 
(pc'  =<pxy  V (pzx  V (px 


Pt  =0  < t < 4 (n  - 1)  + 1 

We  use  the  notation  Jp  to  indicate  the  negation  of  the  predicate  p.  Define  the  space-time 

domain  of  he  array  T = { (x,  y,  z,  t)  \ p„  f\  pt }.  From  now  on  unless  otherwise  specified 
(2,  y,  z,  t ) always  refers  to  any  (x,  y,  z,  i)  G T. 


Let  A, 


tri) 

Bn„t  and  Cn, 


Btn,  and  Cin  be  the  input  data  streams  (a  function  over  [T  pi)  and  A, 
-out  ana  t out  be  the  output  data  streams.  Then  the  following  process  definition  specify 
how  each  output  stream  is  related  to  the  input  streams.  For  example,  each  hexagonal 
elemen  within  the  hexagon  (when  ph  holds),  has  an  inner  product  element  for  computing 


V 


4 

cout  and  two  delay  elements  for  computing  aout  and  bout.  The  definition  below  defines  for 
all  the  elements  in  the  space-time  domain  in  a structured  way. 


Process  Definition 


A.m  = \{x,y,z,  Ain{x'  V,Z't~ 


(,2a) 


B„t  = Hx,  y,  z,  Bi"iX’  y’Z,t~  !) 


Cout  — X(x,  y,  z , i). 


Pc  A {ipyz  V tpy  V tpz)  -+  Ctri(x,  y,z,t  — 1) 

-*•  y,  2T,  t - 1)  + Atn(x,  y,  z,  t - 1)  x Btn(x,  y,z,t  — 1)  (-2c) 

( else  — ► _ 


Next  we  define  the  connection  plans  for  all  the  elements.  It  specifies  which  output 
connects  to  which  input  in  a structured  way. 

Connections 


Ain  — 


t — o -+  Ain(x,y,z,t) 

\(x,  y,  z,  i).i  t > 0 -+  + li  y + 1> z,  fy 

I _ l«V  -*  A»ut(*,  y, * ~ 1,  t) 


<else 


Bin  = 


{t  — 0 ->  Bin(x,y,z,t) 
t > 0 _>  fvh  + i,y,z  + i,t) 

\Pb'  * B0 ut(x,  y 1,  z,  t) 

else  — f _[_. 


= 


p = 0 ->  Cin{x,  y,  z,  t) 

\(x,  y,z,t).\t  > 0 -*  |^c  C°ut(x>  V + l,z+  1, 0 

l W —►  Cout(x  - 1,  y,  2T,  t) 

\ else  — > J_. 


By  substituting  (3a)  into  (2a),  we  obtain 


i = 1 


Aout  = X(z,  y,  2r,  t). 


t > 1 


\Pa  -►  y,  Z,  0) 

W'  -*■  o 

(Va  Ao  tit  (x  + l,y  + l,z,t-l) 

l Pa'  Aout(x,  y,z,t  — 1) 


(3a) 


(36) 


(3c) 


(4) 


Similarly,  we  can  substitute  (3b)  into  (2b)  and  (3a),  (3b),  (3c)  into  (2c)  to  obtain  a system 
Ox  recursion  equations  m A<,ut,  Bout  and  Cout  and  the  initial  conditions. 

These  equations  define  the  behavior  of  the  hexagonal  array  itself  independent  of  the 
, ut  to  the  array.  Next  we  specify  the  input  and  output  transformation  functions  which 
ate  the  structure  of  the  hexagonal  array  with  the  abstract  data  structure  of  matrices. 

Let  A1  = (hi,  h\,hl)  denote  the  initial  streams  and  A“  = (A“  Af , A“)  denote  the 
fmai  streams  which  are  the  minimum  solution  of  the  above  system  of 

diiM?(  s el«“«nts  from  the  domain  D can  be  thought  of  as  a function  from  the 
define  thf  foSL^dS^k  by  & “d  C'  “d 


domain  of  integers  from  0 to  n — 1:  }/ 

domain  of  matrices:  X = [Jl/2  — +.  £>]; 
domain  of  data  streams:  S ~ [T  -*  D], 
domain  of  transformation  functions:  T = [T  — X2] 

r = [M2  -♦  V] 


We  define  the  input  transformation  function  {ga,gb,ilc)  E T3: 


ga  — [Iai  Jo), 

gh  = (Ib)  Jb),  gc  = [IC)  /c) 

vhere 

Ia  = Hx,y,  z,t). 

f2t/-z=E0(mod  3)  — +•  2y~x 
( else  — *■  J_ 

Ja  = Hx,y,z,  t).- 

hx  - y = 0 (mod  3)  -> 
[else  — > J_ 

(6o) 

h = \{x,y,z,  <).j 

\2x  — 2 = 0 (mod  3)  — > 2x~1 
[ else  — > J_ 

h ~-z-  \(x,y,zt  <).j 

\2z  — x = 0 (mod  3)  — > 2z~x 
[else  — ► J_ 

(66) 

h = \{x,y,z,t).  | 

2t/  — 2 = 0 (mod  3)  — > 
else  — » J_ 

/c  = Hx,y,z,t).^ 

2z -y  = 0 (mod  3)  - 2z~y 
else  — > J_ 

(6c) 

We  use  the  shorthand  notation 

ga(x,  y,  z,  t)  = 

[Ia{x,  y , z,  tj,  Ja(x,  y,  z,  t)) 

V * * .V*  -O  -O  * ' - * < * . * N V.  1 • , .*  ,*  V .*  V 
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^ blei0Te’  9aihj)  denotes  the  component-wise  application  of  (i,  j)  to  the  four  components 

8 a * 

The  resulting  matrice , are  defined  as  follows, 

A = h~g'a,B'  = h?9>b>C'  =h~g'c> 
where  , h £° , h ^ £ S , the  final  streams 

a'.b'.c’eM:  ,/«er(W) 


We  now  verify  that  the  above  system  of  iecursion  equations  end  the  input  output 
transformation  functions  correctly  implement  the  familiar  matrix  operations,  i.e. 


=A(i,j) 

#(hj)  ) 

n— 1 

c'{i,j)  --=  A(i,  k)  X B(k,j)  + C{i,j) 

fc— o 

where  0 < i < n,  0 < j < n 


(9  a) 

m 

M 


We  verify  first  the  following  lemmas  which  are  the  final  streams  (the  solution  of  the 
recursion  equations)  in  the  space-time  domain. 


Lemma  A,B: 

A>ut{x,y,z,i)  — 


Pc.  V <fia'  “+ 

Ain(x  + max(i  - 1 - z,  0),  y + ma x(t  -1-  z,  0),  max(z  - (t  -1),  0), 
else  ~ ^ j 


(<Pb  V <pb> 


(10a) 


Bout{z,  y,  z,  t]  — j Bin(x  + max(t  — 1 - y,  0),  max(j/  - (t  - 1),  0),  z + max(*  - 1 -.y),0) 


Lemma  C: 


(106) 


LetU,  = 6-1  + k : - y,  U2  = t - 1 + k - z,  ss  [t  - 1 - x - k)  - (y  + k),  Vo  = 

(i  a -z-k)-{z+k),Ki  = 1— min(z,  t— 1)  and  K2  = min(n— 1 — y n—l—z,  t — \—zx) 
Deane  r 


^ A'n(z  + ^ + max(tr2, 0),  y + max(ti2,  0), 

k=Ki 


max(— £7o,  0),  0) 


X 5iri(x  + k + max(f/1 , 0),  max(-{7! , 0),  z + max(J71 , 0),  0) 


s2  = Y^ (max (Vo, 0 ),y  + k + max(y2,  0),  max(-y2,  0),  0) 
o 

X BJ„(iaax(F1 , 0),  max(~y1  J0),z  + k + max(V1 , 0),  0), 

then 


{(pc  V <Pc • — ► 

Cin{ max(z  -(*-  1),  0),  y + ma x(i  - 1 - x,  0),  z + max(t  - 1 - x,  0),  0 
+X(z,  y,  z,  t).Si(x,  y,  z,  t)  + X(x,  y,  z,  t).So(x,  y,  z,  t ) 
else  — > J_ 

These  relate  outputs  on  any  point  in  space-time  domain  to  the  initial  input  streams. 
Since  they  are  total  functions  in  space-time  domain,  the  solution  of  the  equations  is 
automatically  the  minimum.  Thus  one  way  of  verify  them  is  simply  substitute  them  into 
the  recursion  equation  and  check  if  the  equations  hold.  The  simple  substitution  technique 
will  not  work  in  general,  since  the  final  streams  are  not  necessarily  total  in  the  space-time 

domain.  In  this  case,  an  inductive  proof  showing  that  the  final  streams  are  the  minimum 
solution  is  necessary. 

The  computation  waves  of  such  a system  are  very  instructive  in  such  proofs.  We 
observe  that  there  are  two  triangular  waves,  one  incident  wave  proceeding  toward  the 
origin.  Another  is  a reflected  wave  proceeding  outward  from  the  origin.  We  define  the 
phase  of  the  reflected  wave  to  be  x + y + z - * and  that  of  the  incident  wave  to  be 
x + y + z + K.  A wave  front  is  defined  in  the  traditional  way  as  the  locus  of  all  points 
of  the  wave  having  the  same  phase.  With  t fixed,  there  are  many  of  such  wavefronts 
spread  out  in  space.  In  this  particular  system,  we  number  these  wavefront  positions  by 
w — x + y + z.  Notice  that  the  partial  result  of  a particular  element  of  a matrix  is 
carried  on  by  a single  wavefront  with  one  value  of  phase.  Intuitively,  the  induction  for 
the  reflected  wave  is  on  the  wavefront  of  phase  <j>  and  w = k to  phase  <f>  and  w = k + 1. 
For  the  incident  wave,  inductions  proceeds  from  w = k to  w = k — 2. 

The  pair  (0,  w)  can  be  formalized  as  a well-founded  set  (a  set  with  no  infinite  decreas- 
ing sequences,  so  that  induction  is  valid  on  the  set)  with  the  binary  relation  -<,  which  is 
defined  as  the  transitive  closure  of  X*,  the  binary  relation  defined  below  only  on  neigh- 
boring elements:  6 

incident  wave:  -<*(<^ 2,  w2)  if  <f>i  = <f)o  and  w\  = wo  + 2. 

reflected  wave:  -<*(^2>w2)  if  <j>\  = <j>o  and  wi  = wo  — 1. 

The  inductive  proof  is  given  in  [CHEN  82]. 

By  composing  the  input  and  output  transformation  functions  with  the  initial  and 
final  streams,  we  obtain  (9a),  (9b),  and  (9c).  The  detailed  proof  is  in  [CHEN  82], 

Self-timed  Systems 

A self-timed  system  differs  from  a synchronous  system  in  the  sequencing  of  system 
events.  In  a synchronous  system,  all  processes  are  activated  simultaneously  by  the  same 


v 
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cock  cycle,  i.e.  all  processes  have  the  same  invocation  number.  The  invocations  of  each 
process  are  ordered  by  the  linear  sequence  of  the  clock.  Thus  all  the  invocations  (5  t) 

W SPaCe  parameter  and  1 is  the  time  parameter,  have  a unique  partial  ordering 

-<  defined  to  be  & 

(5i  i ^i)  (s-2, 1 2)  if  s x = so  and  tL  < to . 


In  a self-timed  system,  the  ordering  of  the  system  events  is  not  self-evident  as  in  the 
synchronous  system.  Each  process  is  invoked  only  when  all  of  its  inputs  are  ready.  Thus 
the  overall  system  timing  is  an  emergent  property  of  the  ensemble  attributed  by  the  local 
synchronization  of  all  the  processes  in  the  system.  In  such  a system,  the  “time”  component 
o he  space-time  domain  is  a function  of  the  space  component,  i.e.,  each  process  has  its 
own  time-frame.  The  relation  between  time-coordinates  of  two  communicate  g processes 
needs  to  be  asserted  in  the  connection  plans.  This  time-domain  relationship  among 
processes  must  be  verified  since  it  depends  on  the  initial  data  arrangement  because  self- 
timed  elements  are  triggered  by  the  input  data. 

The  following  is  the  space-time  algorithm  for  the  self-timed  matrix  multiplication  on 
he  systolic  array.  We  define  a few  more  predicates  to  specify  where  tne  initial  data  will 
be  put.  Notice  that  this  set  of  predicates  covers  much  less  area  than  the  set  <pa,  tpb  and 
<Po  since  the  self-timed  algorithm  does  not  need  padding  zeros  in  the  input  data  streams. 

<pa  =(0  < \x  - y|  < n)  A (0  < min(i,  y)  < n)  A (z  — 0) 

'P'b  =(°  < \z  - x|  < n)  A (0  < min(z,  x)  < n)  A {y  = 0) 

<p-c  =(0  < \y  — z\  < n)  A (0  < min(y,  z)  < n)  A (z  = 0) 

We  also  redefine 

<p„  ~ip-a  V <p-b  V <p-c 
V,  z)  =0  < t < n — I — max(z,  y,  z) 

Process  Definition 


■A-out  — 


<Pi  V <pa'  -*  Ain(x,  y,  z,  t(x,  y,  z)) 
else  — ► _[_ 


COHt  — X(z,  y,  2,  £).■{ 


= X(s,y,2,<).| 

Bout  — X(2,  y,  2,  £).-| 

{p*  A {(Pyz  V V <pz ) — ► Cjn(x,  y,  2,  t(x,  y,  2)) 


<P~b  V <pv  Bin{x,  y,  2,  t{x,  y,  2)) 
else  — + J_ 


(12a) 

(12  b) 


\<f>h  -+  Cin{z,y,z,t(x,y,z))+Ain(x,y,z,t(z,y,z))  X Bm(x,y,z,t(x,y,z)) 
{else  — ► J_ 

(12  c) 


Connection  Plans 


P ~ 0 Ain{x,y,z,t(x,y,z)) 

Ain  — X(z,  y,z,t).\t  > 0 — ► + l>y  + 1; z,  t(*,  V,  z ) - 1) 

{'Pal  -*  Aont{x}ytz-  l,t{x,t,z)) 


(13a) 
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p _ 0 -f  B{n(x,  y,  z,  t(x,  y,  z )) 

Bin  = \[x,  y,  z,t),\t  > 0 — ► ~ * Bout(x  + 1,  y,  z + 1,  t(x , y,  z)  — 1) 

1^6  “*■  V ~ 1,  z,  t{x,  y,  z)) 

V else  — ► J_. 


{t  = 0 ->  ^n(i,  y,  s,  £(z,  y,  2)) 

£ > 0 — *•  l^c  ~ * Cout(z,  y+l,z  + l,  t(x,  y,  z)  — 1) 
{(pc>  ->  <?out(a:  - 1,  y,  z,  £(x,  y,  z)) 

else  — ► 


(13c) 


Input  transformation  functions:  The  functions  (u.n.ic)  € T3  map  from  the  space- 
time  domain  V to  the  matrix  indices  X/2  as  specified  in  (5). 


ga  = (X(z,  y,  z,  t).y,  \(x,  y,  z,  t).x) 


gb  = (\(x,  y,  z,  t).x , X(z,  y,  z,  t).z ) 
yc  = (X(z,  y,  z,  £).y,  X(x,  y,  z,  t).z) 


The  initial  streams  are  defined  by  using  the  composition  of  the  input  transformation 
functions  and  the  matrix  function. 


hi  = 


X(z,y,z,£)^  0 

t > 0 


{ 


Pi  ->  Aga(x,  y,z,t ) 
else  — ► J_ 


_L 


hi 


X(; 


*»  ?/i  0 j {e/ae  — ► 

0 — f | 


Bgb{x,y,  z,  t ) 

_L 


^ = M*»  ?»*»*)’ 


It  = 0 


t£  > 0 


I Pc  C9c{x,y,z,i ) 

(e/<se  — ► J_ 

± 


(14) 


Output  Transformation  Functions:  The  functions  {g'a,  g'b,  g'c)  £ T'3  define  the  space- 
time  coordinates  associated  with  each  element  of  the  output  matrix. 


g'a  = [Xa>  Ya,  Za,  Ta),  g'.  = (Xbl  Ybl  Zb,  Tb),  g'c  = (Xc>  Yc>  ZCI  Tc). 


(2)  All  outputs  of  the  previous  invocation  have  been  taken  before  another  in- 
vocation is  initiated. 

(3)  All  mputs  to  an  invocation  t(x,y,z)  come  from  invocations  of  t(x,y,z)  and 
KX)V^Z)  ~ 1)  depending  upon  location. 

We  consider  a process  at  (z,  y,  z)  with  k0  = ?±f±£  + t{x,  y,  z).  From  Figure  1,  it  has 
three  inputs  a.in,  b.in  and  c.i  from  the  following  neighboring  processes  respectively. 


.(<Pa->{z  + l,y+l,z) 

m' W->(w- 1) 

a.  . In  -*■  (®  + hv,z  + 1) 
m ■ W - (*,  V - 1,  z) 

r.  . Wo  -*■  (z,  y + l,z+  1) 
tn-\<Pc'-+(z-l,y,z) 


(17) 


Since  k0  1 < k0,  the  hypothesis  is  true  for  this  same  process  at  t(x,  y,  z)  — 1,  the 
previous  invocation.  By  the  induction  hypothesis  (16),  this  invocation  takes  its  inputs 
from  the  above  processes  at  their  time  t{x,  y,z)~  1 or  t(x,  y,z)~  2 depending  on  where 
the  process  is  located. 

This  process  provides  outputs  to  the  following  neighboring  processes. 


JV*  A (x  > 0)  A (y  > 0)  - (x  - 1,  y - 1,  Z) 

V <pz  V <Py  -+  (x,  y,z  + 1) 

A (x  > 0)  A (z  > 0)  — ► (x  — 1,  y,  z — 1) 
1^6'  V <px  V <pz  -*■  (x,  y + 1,  z) 


(<PiA(z  >0)A(y>0)-*(x,y-l,z-l) 
\<pc'  V <pz  V <py  — ► (x  + 1,  y , z\ 


We  assert  that  these  outputs  from  the  invocation  number  t(x,  y,z)~  1 of  process  (x,  y,  z ) 
are  taken  by  either  invocation  number  {t(x,y,z)  - 1)  or  t{x,y,z)  of  these  neighboring 
process  depending  upon  their  locations.  Since 


y + z + 1 


+ (*(z,y,z)  - 1)  = 


-1 1=  £zi±X 

3 


1 + z 


+ (*(®, y,z)-  i)  + l = k~~  < k 


Thus  the  induction  hypothesis  can  be  applied.  Process  (x,y,z)  is  ready  to  start  a new 
invocation  once  all  of  its  three  inputs  are  ready.  Since  the  processes  that  provide  output 
to  ,t  at  their  respective  time  frame  t(x,  y,z)~  1 or  t{x,  y,  z)  satisfy  the  following  inequality 


£±i±ji±i±i +*(«.»,.)-  x - ^r-1 + nz,  ,i  _ * _ i 


< k, 


r> 


the  induction  hypothesis  applies  to  them.  Process  (x,y,z)  has  three  and  only  three  inputs 
ready  at  their  respective  locations  after  its  invocation  number  t(x,y,z)  - 1.  The  align- 
element  insures  that  no  invocation  can  occur  before  all  three  inputs  are  ready,  thus  process 
(x,  y,  z)  has  its  invocation  number  t(x,  y,  z ) occurring.  This  proves  the  above  assertions.^] 

This  algorithm  is  also  deterministic  for  we  can  define  a partial  ordering  -<  on  The 

invocations  of  processes.  This  binary  relation  is  defined  as  the  transitive  closure  of  .the 
binary  relation  v*  as  follows. 


( x y Vuz>  t(x>  V,  z))  -< *vx0,  yo,z0,  t(x0,  yo,zo)) 

^chhthateX1St  and  (*3, y^zz,t{x-i,yz,^)) 

fa  - ((*i  = xo)  A (Jfi  = 2/o)  A (Zl  = z0  - 1)  A (t(Xl  ,y1>z1)  = t(x0)y0)  *0))) 

[<Pa'  -*  ((xx  - x0  + 1)  A (Vl  = y0  + 1)  A (Zl  = Zo)  A (t(x1)yi,Zl)  = t(x0ly0>z0)  - 1)) 

(18a) 


and 


fa  fa  ~ x° ) A = yo~  1).A  {zo  = z0)  A (<(x2,  y2,  z2)  = *(x0,y0,z0))) 

-*o+l)A  (y2  = y0)  A (z2  = zo  + 1)  A (i(x2,  y2,  z2)  = i(z0,  y0,  z0)  - 


and 


1)) 

(186) 


fOT  - (*,  = *0  - 1)  A (»,  = yo ) A (23  = *o)  A Wj-j.kj,*,)  = t(x0,{/0,2o)) 

- (*,  = xo)  A (t/3  = !/o  + 1)  A (*,  — 20  + 1)  A (i(x3, S/3,23)  = i(x0,S/o,2o)  - 


1) 


This  definition  can  be  derived  from  (16)  with  the  existance  of  the  align-element  folth'e 
inputs  of  each  process  as  an  assumption. 


Now  we  proceed  to  verify  the  algorithm  by  proving  two  lemmas. 

Lemma  a,b: 

AoUt{x,y,z,t ) = 


,{pa  V (pai  -*• 

Ain(z  + t,y  + t,0, 0) 
l else  — >•  J_ 


(IS  a) 


B0ut{x,  y,z,t)  = 


'^6  V V5&'  -* 

Bin{x  + t,  0,  z -f-  t,  0) 
^else  — + J_ 


(1S6) 


Lemma  C: 


„ , ^ f ^ V Tt  tC->ax(J  - 0),  y + max(f  — x,  0),  z + max(i  - x,  0)  0) 

^out(z,  y,  z,  i)  — < + Elbio  A*n{k.  y + t,  0, 0)  X Bin{k,  0,z  + t,  0) 


S^iniJar  to  the  synchronous  case,  we  can  either  prove  by  direct  substitution  or  by 
induction  on  K,  the  set  of  wavefront  number.  By  composing  these  lemmas  with  input  and 
output  transformation  functions,  we  can  derive  [9a),  (9b)  and  (9c). 

From  both  algorithms,  we  observe  that  the  input  and  output  transformation  functions 
and  the  semantics  of  the  hexagonal  array  are  much  simpler  for  the  self-timed  version.  This 
result  is  not  accidental,  for  the  interaction  among  flows  of  data  for  this  particular  algorithm 
only  utilizes  one  third  of  th'  maximum  space-time  resources.  In  the  self-timed  version,  only 
one  third  of  the  processes  (all  processes  with  the  same  k = -±|±f  +t{x,y,z ))  are  active  at 
any  instant.  In  the  synchronous  version,  all  processes  are  active  at  all  times,  thus  padding 
zeros  are  necessary.  The  simplicity  of  the  self-timed  version  is  a pay-off  of  the  more 
sophisticated  synchronization  method.  It  is  necessary  to  prove  that  local  synchronization 
gives  rise  to  the  global  sequencing  relations  among  all  the  processes.  Describing  these  two 
algorithms  in  CRYSTAL  not  only  shows  the  capability  of  our  framework  but  also  provides 
many  insights  into  the  the  complexity  of  various  aspects  of  these  two  different  timing 
schemes.  We  have  achieved  one  of  the  important  goals  of  this  research  - by  providing  a 

formalism  in  which  one  can  gain  a much  deeper  understanding  of  the  subject  one  describes 
m the  process  of  so  doing. 

Conclusion 

We  have  presented  a notation  and  formal  semantics  for  general  non-linear  systems 
with  memory.  An  essential  part  of  the  semantics  is  a methodology  for  abstracting  the 
behavior  of  such  systems  so  they  can  be  used  as  components  at  a higher  level.  The 
semantics  of  a particular  system  consists  of 

(i)  An  input  mapping  function  from  the  value  domain  to  the  space-time  structure 
of  the  system. 

(ii)  A function  in  the  space-time  domain  which  completely  defines  the  operation  of 
the  system. 

(m)  An  output  mapping  function  from  the  space-time  structure  to  the  value  domain. 

The  abstract  semantics  of  the  system  is  obtained  by  eliminating  space-time  variables 
to  yield  a function  in  the  value  domain  alone.  When  it  is  possible  to  eliminate  all 
intermediate  variables,  as  it  was  with  the  Kung  array,  the  abstract  system  is  purely 
functional.  When  some  intermediate  state  variables  remain,  as  in  the  case  where  real-time 
input  is  necessary,  the  system  is  defined  by  an  abstract  sequential  process.  Such  a process 
is  defined  bv  a system  of  recursion  relations  in  time.  From  an  engineering  point  of  view, 

the  input  and  output  mapping  functions  serve  as  precise  interface  specifications  for  the 
system. 

The  methodology  can  be  applied  to  any  system:  linear,  non-linear,  time-varying 
history-dependent.  We  believe  it  provides,  for  the  first  time,  a unified  aporoach  spaning 
the  range  from  computer  programs  to  linear  transfer  functions;  from  transistor  circuits 
to  high  level  communicating  sequential  processes. 
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